Evaluating Value Weighting Schemes in the Clustering of Categorical Data

نویسندگان

  • Periklis Andritsos
  • Vassilios Tzerpos
چکیده

The majority of the algorithms in the clustering literature utilize data sets with numerical values. Recently, new and scalable algorithms have been proposed to cluster data sets with categorical data, data whose inherent ordering is not obvious. However, these algorithms deem all data values present in the data sets as equally important. Thus, the resulting clusters may be influenced by values that appear almost exclusively and reflect non-natural groupings. In this paper, we present a set of weighting schemes that allow for an objective assignment of importance on the values of a data set. We use well established weighting schemes from information retrieval, web search and data clustering to assess the importance of whole attributes and individual values. To our knowledge, this is the first work that considers weights in the clustering of categorical data. We perform clustering in the presence of importance for the values within the LIMBO framework, a new and scalable algorithm to cluster categorical data. Our experiments were performed in a variety of domains, including data sets used before in clustering research and three data sets from large software systems. We report results as to which weighting schemes show merit in the decomposition of data sets.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

ارائه یک الگوریتم خوشه بندی برای داده های دسته ای با ترکیب معیارها

Clustering is one of the main techniques in data mining. Clustering is a process that classifies data set into groups. In clustering, the data in a cluster are the closest to each other and the data in two different clusters have the most difference. Clustering algorithms are divided into two categories according to the type of data: Clustering algorithms for numerical data and clustering algor...

متن کامل

Improving categorical data clustering algorithm by weighting uncommon attribute value matches

This paper presents an improved Squeezer algorithm for categorical data clustering by giving greater weight to uncommon attribute value matches in similarity computations. Experimental results on real life datasets show that, the modified algorithm is superior to the original Squeezer algorithm and other clustering algorithm with respect to clustering accuracy.

متن کامل

A novel attribute weighting algorithm for clustering high-dimensional categorical data

Due to data sparseness and attribute redundancy in high-dimensional data, clusters of objects often exist in subspaces rather than in the entire space. To effectively address this issue, this paper presents a new optimization algorithm for clustering high-dimensional categorical data, which is an extension of the k-modes clustering algorithm. In the proposed algorithm, a novel weighting techniq...

متن کامل

Central Clustering of Categorical Data with Automated Feature Weighting

The ability to cluster high-dimensional categorical data is essential for many machine learning applications such as bioinfomatics. Currently, central clustering of categorical data is a difficult problem due to the lack of a geometrically interpretable definition of a cluster center. In this paper, we propose a novel kernel-density-based definition using a Bayes-type probability estimator. The...

متن کامل

خوشه‌بندی خودکار داده‌های مختلط با استفاده از الگوریتم ژنتیک

In the real world clustering problems, it is often encountered to perform cluster analysis on data sets with mixed numeric and categorical values. However, most existing clustering algorithms are only efficient for the numeric data rather than the mixed data set. In addition, traditional methods, for example, the K-means algorithm, usually ask the user to provide the number of clusters. In this...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006